- Linear Regression
- Non-Linear Regression
- Logistic Regression
Linear Regression¶
Assumption¶
In [1]:
import sys
import numpy as np
import pandas as pd
sys.path.append(r'../../')
from utils.plots import plot_linear_plot
import plotly
plotly.offline.init_notebook_mode()
seed = np.random.seed(seed=42)
In [2]:
m = 10
b = 2
noise = 50
x = np.arange(start=0, stop=10, step=1)
noise = np.random.randint(low=0, high=noise, size=len(x))
y = m * x + b + noise
In [3]:
# first order polynomial - simple linear regression
coeff = np.polyfit(x=x, y=y, deg=1)
y_pred = (coeff[0] * x**1) + coeff[1]
plot_linear_plot(x, y, y_pred)
In [4]:
# third order polynomial
coeff = np.polyfit(x=x, y=y, deg=3)
y_pred = (coeff[0] * x**3) + (coeff[1] * x**2) + (coeff[2] * x**1) + coeff[3]
plot_linear_plot(x, y, y_pred)
In [5]:
# using numpy polyval function
coeff = np.polyfit(x=x, y=y, deg=8)
y_pred = np.round(np.polyval(coeff, x),2)
plot_linear_plot(x, y, y_pred)
Accuracy metrics¶
MAE (Mean Absolute Error)
- Pros:
- The MAE is expressed in the same unit as the output variable.
- Robust to outliers.
- Cons:
- Not differentiable, so it can't be used as a loss function.
- Pros:
MSE (Mean Squared Error)
- Pros:
- Differentiable and can be used as a loss function.
- Cons:
- Output is in squared units.
- Not robust to outliers due to squared differences.
- Pros:
RMSE (Root Mean Squared Error)
- Pros:
- Output is in the same unit as the target variable.
- Pros:
RMSLE (Root Mean Squared Log Error)
- Pros:
- Does not penalize high errors due to the logarithm.
- Useful when underestimation is unacceptable.
- Cons:
- Large penalty for underestimation.
- Pros:
MAPE (Mean Absolute Percentage Error)
- Pros:
- Reflects errors for both high and low magnitude values.
- Cons:
- Sensitive to outliers.
- Pros:
R2 (R-Squared) $$R^2 = 1 - \frac{{SS_{\text{res}}}}{{SS_{\text{tot}}}}$$
- Pros:
- Compares regression line to mean line.
- Useful for model comparison.
- Value between 0 and 1 (1 being best). How much variance is explained by your model.
- Cons:
- Adding useless features doesn't decrease R2.
- Pros:
Adj R2 (Adjusted R-Squared) $$ \text{Adjusted R}^2 = 1 - \left(1 - R^2\right) \cdot \frac{{n - 1}}{{n - k - 1}} $$
- Pros:
- Crucial for model evaluation.
- Decreases with irrelevant features.
- Pros:
Regularization¶
Regularization is important to maintain bias-variance trade off or overfitting/underfitting.
Bias-Variance Trade-off:
- Polynomial regression aims to find a balance between bias (underfitting) and variance (overfitting).
- High-degree polynomials can fit the training data perfectly but may generalize poorly to unseen data (overfitting).
- Regularization helps control this trade-off.
Why Regularization?:
- When fitting polynomials, we often face a dilemma:
- Low-degree polynomials (e.g., linear or quadratic) may underfit the data.
- High-degree polynomials (e.g., cubic or higher) may overfit the data.
- Regularization provides a way to address this by introducing a penalty term.
- When fitting polynomials, we often face a dilemma:
Penalty Term:
- Regularization adds a penalty to the loss function.
- The total loss becomes: Loss = Loss Function + Penalty
- The penalty discourages large coefficients, preventing overfitting.
Types of Regularization:
- L2 (Ridge) Regularization:
- Adds the sum of squared coefficients to the loss function.
- Encourages small coefficients.
- Helps prevent overfitting.
- L1 (Lasso) Regularization:
- Adds the sum of absolute coefficients to the loss function.
- Encourages sparse models (sets some coefficients to exactly zero).
- Useful for feature selection.
- Elastic Net Regularization:
- Combines L1 and L2 regularization.
- Balances between sparsity and smoothness.
- L2 (Ridge) Regularization:
Effect on Coefficients:
- Regularization shrinks the coefficients toward zero.
- Smaller coefficients lead to simpler models.
- It helps prevent overfitting by reducing the model's complexity.
Continuous Complexity Range:
- Regularization provides a continuous range of complexity parameters.
- Unlike choosing a fixed polynomial degree, you can fine-tune the regularization strength.
- This flexibility allows finding the right balance between bias and variance.
Methods to detect overfitting.¶
- Visual Inspection: Plot the fitted line against the data points.
- Cross-Validation: Use techniques like k-fold cross-validation to assess model performance on unseen data.
- Learning Curves:
- Plot the model’s performance (e.g., accuracy or loss) against the size of the training dataset.
- If the training performance keeps improving while the validation performance plateaus or worsens, overfitting could be occurring.
- Feature Importance Analysis: If a few features dominate, it might indicate overfitting.
In [ ]: